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Abstract. S.Co.P.E. is one of the four projects funded by the Italian Government in order to 
provide Southern Italy with a distributed computing infrastructure for fundamental science. 
Beside being aimed at building the infrastructure, S.Co.P.E. is also actively pursuing re- 
search in several areas among which astrophysics and observational cosmology. We shortly 
summarize the most significant results obtained in the first two years of the project and re- 
lated to the development of middleware and Data Mining tools for the Virtual Observatory. 
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1. Introduction 

S.Co.P.E. is a general purpose GRID infras- 
tructure of the University Federico II in Naples 
funded through the Italian National Plan 
(PON) by the Italian Government to support 
both fundamental research and small/medium 
size companies. The infrastructure has been 
conceived as a metropolitan GRID, embedding 
different (and in some cases pre-existing) and 
heterogeneous computing centers each with its 
specific vocation: high energy physics, astro- 
physics, bioinformatics, chemistry and mate- 
rial sciences, electric engineering, social sci- 
ences. Its intrinsically multi-disciplinary na- 
ture renders the S.Co.P.E. an ideal test bed for 
innovative middleware solutions and for inter- 
operable tools and applications finely tuned on 
the needs of a distributed computing environ- 
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ment. In what follows we shall shortly out- 
line the main activities in the fields of astro- 
physics and observational cosmology and, in 
particular, we shall focus on: i) the ongoing ef- 
forts aimed at integrating the S.Co.P.E. GRID 
(hereafter SG) with the international Virtual 
Observatory (Sect.2), and ii) the implementa- 
tion in the SG of the data mining (DM) VO- 
Neural package (Sect.3) which is developed 
in the framework of a collaboration with the 
Dept. of Astronomy at Caltech. In Sect. 4 we 
shortly outline a template scientific application 
and, finally, in Sect. 5, we outline some future 
developments. 



2. The VOb and the GRID 

The Virtual Observatory (VOb) is an in- 
ternational effort coordinated through the 
International Virtual Observatory Alliance 
(iIVOA: URL. 1) aimed at: i) federating and 
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making interoperable all astronomical data 
archives produced by both ground based and 
space borne instruments; ii) deploying a new 
generation of science applications or tools 
which use VOb protocols for exploratory data 
analysis and for the extraction of knowl- 
edge from massive data sets. The VOb is in- 
herently distributed: data collections remain 
with their providers and are accessed through 
standard interfaces. The access to the data 
takes place through a registry which con- 
tains information about data sets, archives, 
catalogs, surveys, and computational services 
that can be accessed throu gh VOb interfaces 
JHanish & De Young 2008h . While the federa- 
tion and fusion of heterogeneous data archives 
and the implementation of flexible data reduc- 
tion and data analysis tools have been widely 
addressed and, at least in their fundamen- 
tal aspects, solved, the possibility to access 
large distributed computing facilities to per- 
form computing intensive tasks has not yet 
been satisfactorily answered. 

One problem to be solved is the conflict ex- 
isting between the VOb and the GRID secu- 
rity procedures: most users of a specific Virtual 
Organization (VO) do not possess the personal 
certificates which are requested to access the 
GRID or, even when they do have a personal 
certificate, the computing GRID which they 
need does not recognize their own certification 
authority. 

In the framework of the VONeural project 
(Sect. Is) and in order to make our Data Mining 
(DM) tools accessible to the wider commu- 
nity, we implemented and tested a general pur- 
pose interface between the UK-ASTROGRID 
(Ihereafter AG: URL.4I) and the SG. 

2.1. GRID-Launcher v. 1.0 

The UK based ASTROGRID is one of the most 
robust astronomical Virtual Organizations so 
far implemented and represents a good ground 
for testing innovative solutions. The main 
problem we had to face was the fact that most 
users which are recognized by the AG User 
Authentication Service do not possess a per- 
sonal GRID certificate and cannot therefore 
access distributed computing resources. This 
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Fig. 1. Grid Launcher v. 1.0 workflows for input 
and output. UI: user interface; RB: resource broker; 
SE: storage element; CE: computing element; WN: 
working node. Upper panel: input flow; lower panel: 
output flow. 



problem can be at least in part circumvented by 
offering the applications as web services to be 
consumed on the Grid via a service certificate 
(or "robot" certificates). At the time GRID- 
Launcher V. 1 .0 was developed, this option had 
not yet been formally accepted by the EGEE-2 
boards and we were obliged to implement a test 
version which makes use of a personal GRID 
certificate (signed by the INFN-GRID CA) 
which is recognized by the S.Co.RE. GRID. 
In a very schematic way, GRID-Launcher 
works as it is summarized in Fig.[T] 

- It takes the user input from the User 
Interface of the ASTROGRID Desktop, 
collects all files, tabs and programs needed 
and generates automatically three scripts: 
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task.sh, task.jdl and wn^runner.sh to be 
executed on the GRID; 

- it wraps them in an archive and sends it 
to the GRID UI (authentication takes place 
with public keys exchange); 

- the UI unpacks them, copies the data 
to the Storage Element (SE), copies 
wn_runner.sh to the WN's, starts task.sh 
and task.jdl; 

- wri-Tunner.sh starts on the WNs, takes the 
data from SE, starts the appUcation and 
puts the results on the SE. The GRID 
generates automatically two output files 
task.err and task.out and sends them to the 
UI using the Output SandBox. 

- GRID-launcher periodically checks the 
status of job and when it ends, it moves 
the results from the UI to the ASTROGRID 
machine. GRID-launcher receives the data 
archive, unpacks them and puts the results 
into the AG Myspace (VO-Space). 

So far, GRID-Launcher v. 1.0 has been 
implemented and tested on an hand- 
ful of applications: VO-Neural_MLP & 
yO-Neural_SVM (cf. Sect. EJ, Sexttactor 
JBertin & Arnouts 19961) & SWarp JURL.Sl) . 



Observatory of Capodimonte-INAF, and is 
currently under continuous evolution. VO- 
Neural allows to extract from large datasets 
information useful to determine patterns, re- 
lationships, similarities and regularities in the 
space of parameters, and to identify outlayers. 
In its final version, it will be accessible both as 
a web application and through the AG Desktop 
and will offer main elaborative features like 
exploratory data analysis, data prediction and 
ancillary functionality like fine tuning, visual 
exploration of the main characteristics of the 
datasets, etc.. Besides ofifering the possibility 
to use the individual routines to perform 
specific tasks, VO-Neural will provide the 
user with a complete framework to write 
his own customized programs. In the next 
two paragraphs we shortly outline the main 
features of two supervised clustering models 
already included in the package which have 
already been used on the GRID-S.Co.RE. for 
specific science applications. 



3.0.1. VONeuraLMLP 



3. VO-Neural 

As it was menti oned above , in the last decade 
man j^ national (Icf. URL.2h and international 
(Icf. URL.3h projects have solved many prob- 
lems related to the federation of heterogeneous 
data sets while much remains still to be done 
for what tools and user interfaces are con- 
cerned. One of the main issues to be solved 
is the implementation of scalable and user 
friendly data mining tools capable to deal with 
the huge VOb data sets. 

VO-Neural is a data mining framework 
capable to work on massive (> 1 TB) data 
sets (catalogues) in a distributed computing 
environment matching the IVOA standards 
and requirements. VO- Neural is the evolution 
of the AstroNeural (JTaghaferri et al. 2003h 
project which was started in 1994, as a 
collaboration between the Department 
of Mathematics and Applications at the 
University of Salerno and the Astronomical 



VONeuraLMLP is an implementation 
of a standard Multi Layer Perceptron 
based on the FANN (Fast Artificial Neural 
Networks) Library (IURL.6h . written in C++ 
dSkordovski 2008h . The algorithm known as 
Multi Layer Perceptron (MLP) is based on 
the concept of perceptron and the method of 
learning is based on gradient-descent method 
that allows to find a local minimum of a 
function in a space with N dimensions. The 
weights associated to the connections between 
the layers of neurons are initialized at small 
and random values, and then the MLP applies 
the learning rule using part of the template 
patterns. Once convergence has been achieved 
and a validation procedure has been applied 
in order to avoid overfitting, the performances 
of the network are evaluated on a disjoint 
test set extracted from the template patterns. 
The resulting network is then applied to the 
original data. 
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3.0.2. VONeuralSVM 

VONeuraLSVM is an imple mentation of 
the Support Vector Machines JRusso 2007t : 
ICavu oti 2008) based on the LIBSVM hbrary 
dURL.Tl) . Support Vector Machines perform 
classification of records into classes by first 
mapping the data into an higher dimensionality 
and then using a set of template vectors (tar- 
gets) to find in this new space a separation hy- 
perplane with the largest margin. W ithout en- 
tering into details (JBoser et al. 19921) . we shall 
just remember that, in the case of the C- 
SVC implemented with the RBF (Radial Basis 
Functions) kernel, the position of this hyper- 
plane depends on two parameters (C and y) 
which cannot be estimated in advance but need 
to be evaluated by finding the maximum in a 
grid of values which is usually defined by let- 
ting C and 7 vary as C = 2"^, 2"^, ...,2'^ and 
y = 2 '^,2 '^, ...,2"*. Due to their computa- 
tional weight, and to the need to run many iter- 
ations for diff'erent pairs of the two parameters, 
SVM are ideally suited for being used on the 
GRID. 



4. An application to the classification 
of AGN in the SDSS 

The astronomical community is used to per- 
form DM tasks in a sort of "hidden" way 
(cf. the case of specific objects selection in 
a color-color diagram) but it has not yet be- 
come familiar with the potentialities of more 
advanced tools such as those described here. 
This is mainly due to the fact that these tools 
are often everything but user friendly and re- 
quire an in depth understanding of the (of- 
ten complex) theory laying behind them; a 
complexity which often discourages poten- 
tial users. Therefore, a crucial aspect of the 
project is the application to challenging prob- 
lems capable to exemplify the new science 
which will emerge from the adoption of a 
less conservative approach to the analysis of 
the data. Two science cases, namely the eval- 
uation of photometric redshifts (a regression 
and classification problem based on the use of 
VONeuraLMLP) and the selection of candi- 
date quasars in the Sloan Digital Sky Survey 



dcf. Stoughton et al. 2002h (based on the use 
of unsupervised clustering algorithms and ag- 
glomerative clustering) have already been pub- 
lished in the literature (liyAbnisco et al. 2007t 
iD'Abrusco et al. 2008h . We shall therefore fo- 
cus on the application of VONeuraUVM to 
the classification of AGNs. 
The classification of AGN is usually per- 
formed on their overall spectral distribution 
using some spectroscopic indicators (equiva- 
lent linewidths, FWHM of specific lines or 
lines flux ratios) and diagnostic diagrams (usu- 
ally called BPT) which are difficult and time 
consuming to derive. In this diagrams AGN 
and not-AGN are empirically separated by 
some lines derived either from the theory or 
fro m empirical laws such as those derived 
by (iK ewlev et al. 20061: iKauff^man et al. 20031 
Heck man 1980h . A reliable and accurate AGN 
classificator based on photometric features 
only, would allow to save precious telescope 
time and enable several studies based on statis- 
tically significant samples of objects. We there- 
fore used a supervised clustering of the pho- 
tometric data exploiting the information con- 
tained in a spectroscopic Base of Knowledge 
(BoK) derived from available catalogues. We 
wish to stress that since neural networks have 
no power of extrapolation all the biases in the 
BoK are reproduced and therefore the BoK 
needs to be as complete and bias-free as pos- 
sible. As classification tools, we used both the 
MLP and, due to the intrinsically binary na- 
ture of the problem (AGN against non-AGN, 
Seyfert 1 against Seyfert 2, etc) also the SVM. 
The BoK was obtained from the fusion of two 
catalogues. 



- (ISorrentino et al 20061) separated ob- 
jects into Seyfert 1, Seyfert 2 and 
"Not AGN" using the Kewley's lines 
(iKewlev et al. 2006.) : 

- a catalogue derived by us from the SDSS 
spectroscopic archive using the criteria 
introduced by (Kauffmanet al. 2003) in 
which objects are classified as AGN, 
not AGN, and "mixed". The Mix and 
Pure AGN zone were further divided into 
Seyfert and LINERs by using the Heckman 
line OHeckman 1980) . 
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experiment 


BoK 


algorithm 


efficiency 


completeness 


AGN vs Mix 


BPT plot + Kewley line 
BPT plot + Kewley line 


MLP 
SVM 


76% 

74% 


54% 
55% 


Type 1 vs 2 


BPT plot + Kewley line 
BPT plot + Kewley line 


MLP 
SVM 


95% 
82% 


~ 100% 
98% 


Seyfert vs LINER 


BPT plot + Hecman & Kewley lines 
BPT plot + Kewley line 


MLP 
SVM 


80% 

78% 


92% 
89% 



Table 1. Summary of the results of supervised classification experiments performed using both 
VONeuraLMLP and VONeuraLSVM. 



We made three experiments using both the 
MLP and SVM, and for all of them we used 
the same set of features (for a definition refer 
to the SDSS specifications) extracted from the 
SDSS database: petroRSQji, petroR50.g, 
petroRSO-r, petroRSOJ, petroR50-z, 
concentration Judex J', fibermag^r, 

(u — g)dered, (g — r)dered, (r — i)dered, 
(i - z)dered, dere d-r, together with the p hoto- 
metric redshift in (JD Abrusco et al. 20081) . We 
performed three types of classification experi- 
ments: AGN vs Mix, Typel vs Type2, Seyfert 
vs LINER. The experiments with SVM were 
performed on the SG using 1 10 worker nodes. 
In order to test the interoperability of the 
four PON projects, the 110 nodes were taken 
from the Napoli, Catania and Cagliari PON 
locations. Results are summarized in Table 
m and, as it can be seen, the use of machine 
learning tools allows to reach performances 
which in some cases (e.g. Type 1 vs 2 with 
MLP's) cannot by any means be achieved 
with more traditional tools. A more detailed 
discussion of the results will be presented 
in (Cavuoti, d' Abrusco & Longo, 2008, in 
preparation). 

5. Future developments 

We plan to continue the development of VO- 
Neural and to offer it as a web application. 
More in detail, we plan to deploy a general 
purpose GRID-Launcher interface capable to 
launch any "command line" program through 
a "robot certificate" (GRID-Launcher 2.0). 

At the moment we are engineering the 
package in order to increase its flexibility and 
capability to adapt to a distributed computing 
environment. We are also implementing 



parallel versions of some tools which are 
particularly demanding in terms of comput- 
ing time. We also plan to integrate, within 
the VO-Neural interface existing statistical 
software (such as , for in stance, the VO-STAT 
web application (iURL.9h '). in order to ensure 
proper statistical tools for exploratory data 
analysis and for the evaluation of the results. 
The status of the project can be monitored at 
the URL: http://voneuraI.na.infn.it/ . 
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